Biochemistry 



© Copyright 1990 by the American Chemical Society 



Volume 29, Number 37 September 18, 1990 



Perspectives in Biochemistry 



Additivity of Mutational Effects in Proteins 

James A. Wells 

Protein Engineering Department, Genentech, Inc., 460 Point San Bruno Boulevard. South San Francisco, California 94080 
Received April 19, 1990; Revised Manuscript Received May 29, 1990 



JLhe energetics of virtually all binding functions in proteins 
is the culmination of a set of molecular interactions. For 
example, removal of a single molecular contact by a point 
mutation causes relatively small reductions (typically 0.5-5 
kcal/mol) in the free energy of transition-state stabilization 
[for reviews see Fersht (1987) and Wells and Estell (1988)], 
protein-protein interactions (Laskowski et al., 1983, 1989; 
Ackers & Smith, 1985), or protein stability [for review see 
Matthews (1987)] compared to the overall free energy asso- 
ciated with these functional properties (usually 5-20 kcal/mol). 
Thus, it is possible to modulate protein function by mutation 
at many contact sites. In fact, to design large changes in 
■ function will often require mutation of more than one func- 
tional residue. 

There is now a large data base for free energy changes that 
result when single mutants are combined. A review of these 
data shows that, in the majority of cases, the sum of the free 
energy changes derived from the single mutations is nearly 
equal to the free energy change measured in the multiple 
mutant. However, there are two major exceptions where such 
simple additivity breaks down. The first is where the mutated 
residues interact with each other, by direct contact or indirectly 
through electrostatic interactions or structural, perturbations, 
.so'that they no longer behave. independently. The second is 
where the mutation causes.a change, in mechanism or tate- 
limiting step of the reaction. It is important to note that the 
addifiveTeffects discussed here do not change the molecularity 
of their respective reactions. When the molecularity of the 
reaction changes [as in comparing the free energy of binding 
of one linked substrate (A-B) versus the sum of two fragments 
(A plus B)], large deviations from simple additivity can result 
from entropic effects (Jencks, 1981 ). Although the focus here 
is on enzyme activity, similar conclusions may be drawn from 
mutations affecting protein-protein interactions, protein-DNA 
recognition, or protein stability. Some practical examples and 
applications are discussed. 

Additivity Relationships 

The change in free energy of a functional property caused 
by a mutation at site X is typically expressed relative to that 



of the wild-type protein as AAG^q. Such free energy changes 
for two single mutants (X and Y) can be related to those of 
a double mutant (designated X.Y) by eq 1 (Carter et al, 1984; 
Ackers & Smith, 1985). The AG, term (also called the 
AAGrxY, = AAGpQ + AAG m + AG, (1) 
coupling energy; Carter et al., 1984) should reflect the extent 
to which the change in energy of interaction between sites X 
and Y affects the functional property measured. It is possible 
for AG, to be either positive or negative depending upon 
whether the interactions between the mutant side chains reduce 
or enhance the functional property measured. Furthermore, 
the AG, term should not exceed the free energy of interaction 
between side chains at sites X and Y except in cases where 
these mutations cause large structural perturbations. This was 
first applied to evaluating the functional independence of 
residues mutated in tyrosyl-tRNA synthetase (Carter et al., 
1984). In one case the sum of the AAG values for single 
mutants was equal to that of the double mutant, indicating 
the sites functioned independently, in another example there 
was a large discrepancy, suggesting the sites were interacting. 

Simple Additivity in Transition-State Binding 
Interactions 

The strengths of noncovalent interactions are strongly de- 
pendent upon the nature of the two groups and the distance 
(r) between them. For example, the free energy of charge- 
charge, random charge-dipole, random dipole-dipole, van der 
Waals attraction, and repulsion decay as l/r, 1 /r*, 1 //*, 1 jr i , 
and l/r' 2 , respectively [for review see Fersht (1985)]. Thus, 
when the side chains at sites X and Y are remote to one 
another and assuming no large structural perturbations, the 
AG, term should be negligible and eq 1 thus simplifies to 
AAGpcY, s AAG TO + AAG(Y) (2) 
This situation, here referred to as simple additivity, is generally 
observed except where side chains are close to each other or 
when one or both of the mutants change the rate-limiting step 
or reaction mechanism. These principles are well illustrated 
from data of additive mutational effects on transition-state 
stabilization energies. 
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ZaaQ, components 
figure 1: Plot of the changes in transition-state stabilization energies 
for the multiple mutant versus the sum for the component mutants. 
Data are taken from Table I and represent mutants from subtilisin 
(■), tyrosyi-tRNA synthetase (O), trypsin (n), DHFR (•), and 
glutathione reductase (A), where mutant or wild-type side chains 
should not contact one another. The dashed line has a slope of 1, 
and the solid Une is a best fit to all the data. 

Changes in transition-state stabilization energy (AAG X ) 
caused by a mutation can be calculated from eq 3 (Wilkinson 
et al., 1983), in which R is the gas constant, T is the absolute 

AAGj --RT In ' (3) 

(AWAMJ-ild-typ. 

temperature, is the turnover number, and K u is the Mi- 
chaelis constant for the mutant and wild-type enzyme against 
a fixed substrate. AAGj represents the change in free energy 
to reach the transition-state complex (E«S*) from the free 
enzyme and substrate (E + S). 

To analyze the proposition that the interaction energy term, 
AGx(i). is relatively small when the sites of mutation (X and 
Y) are remote to one another, AAGt values were collected 
from the literature where side-chain substitutions in the 
multiple mutant are beyond van der Waals contact (>4 A 
distant) from each other (Table I). There are at least 25 
examples distributed across five different enzymes where 
AAGt values can be calculated for the individual and multiple 
mutants assayed in at least two different ways. Among these 
are examples where electrostatic interactions, hydrogen 
bonding, and steric and hydrophobic effects have been altered 
separately or in combination with others. The X-ray structures 
of the wild-type proteins show that the wild-type side chains 
are not in contact. Modeling suggests the mutant side chains 
are beyond possible van der Waals contact unless the mutant 
side chains were to cause significant changes in the overall 
protein structure. Such large changes are rarely observed in 
structures of site-specific mutant proteins (Katz & Kossiakoff, 
1986; Alber et al., 1987; Howell et al., 1986: Wilde et al., 
1988) or even highly variant natural proteins (Chothia & Lesk, 
1986). 

A collective plot of the sum of the AAGf values for the 
component mutants versus the corresponding multiple mutant 
(Table I) gives a remarkably strong correlation (R 1 »= 0.92) 
with a slope near unity (Figure 1 ). The simplest interpretation 
is that the interaction term, AG T( n, is small compared to the 
overall effects on AAG t(x ,y)- It is formally possible that there 
are large and compensating effects between side chains X and 
Y that systematically lead to small net values for AGffij. 

There are some notable exceptions that weaken the corre- 
lation within the data set (Table I). In particular, combining 
the R204L mutation in Escherichia coli glutathione reductase 
gives a less than additive effect, especially when combined with 
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another mutant, R198M (Scrutton et al., 1990). These basic 
residues are not in direct contact, but both side chains form 
a salt bridge with the 2'-phosphate group of NADPH. Indeed, 
the largest discrepancies are when these mutants are assayed 
with NADPH as compared to NADH. Similarly, the sum 
of the AAG T values for two positively charged component 
mutants in subtilisin (D99K. and E156K) overestimates the 
effect of the multiple mutant when assayed with an Arg but 
not with a Phe substrate (Russell & Fersht, 1987). Such 
discrepancies are not too surprising because charge-charge 
interactions fall off as 1 jr and can exhibit long-range effects 
in proteins [for example, see Russell and Fersht (1988)]. The 
physical basis for other large discrepancies not involving 
electrostatic substitutions is less clear but may involve unex- 
pectedly large structural changes or changes in enzyme 
mechanism (see below). 

These additivity tests are not particularly dominated by one 
of the single mutants in the sum. The average contribution 
(±SE) for the most dominant mutant in ech sum calculated 
from the 69 additivity teste given in Table I is only 68% 
(±1 5%) of the total sum (theoretical is ~ 50%). Furthermore, 
the plot in Figure 1 is not analogous to graphs of correlated 
variables, where A is plotted versus the sum of A + B, because 
in Figure 1 the values on the >>-axis are determined inde- 
pendently from those on the x-axis. 

Complex Additivity in Transition-State 
Stabilization— When AG T(1) o 

{A) Change in Interaction Energy between Sites X and Y. 
Where residues X and Y are close enough to contact, it is more 
likely that the AGx(i) term will be significant. There are 1 1 
examples collectively from tyrosyl-tRNA synthetase and 
subtilisin that fit this category (Table II). 

A series of mutants in tyrosyl-tRNA synthetase at positions 
48 and 51 (Carter et al., 1984; Lowe ct al., 1985) show com- 
plex additivity (Table II), His48 and Thr51 in the wild-type 
structure are next to each other on adjacent turns of an a-helix. 
His48 hydrogen bonds to the ribose ring oxygen of ATP while 
Thr51 can.make van der Waals contact with ATP. The T51P 
mutation increases the catalytic efficiency of the enzyme in 
some assays by more than -2 kcal/mol (Wilkinson et al., 
1984). However, when this mutation is combined with mu- 
tations at position 48, the effects are not simply additive. An 
X-ray structure of the T51P mutant indicates there are no 
structural changes in the a-helix (Brown et al., 1987). Instead, 
it is suggested that the T51P mutant is improved over wild 
type because the wild-type enzyme contains a bound water in 
the vicinity of Thr51 that disfavors substrate binding. Blow 
and co-workers (Brown et al., 1987) argue that the change 
in solvent structure propagated to position 48 may account for 
the complex additivity. In the previous section, the double 
mutant (H48G.T51A) exhibited nearly simple additivity 
(Table I). Presumably, the smaller and less hydrophobic 
alanine substitution at position 51 should not introduce as large 
a- change in solvent structure as the pyrrolidone ring of proline. 

In the case of subtilisin (Table II), Glul56 is near the top 
of the PI binding crevice while Glyl66 is at the bottom. In 
the wild-type enzyme these sites do not make direct van der 
Waals contact, but large side chains substituted at position 
166 can be modeled to contact the residue at position 156. In 
fact, X-ray structural analysis shows that an Asn side chain 
at position 166 makes a good hydrogen bond with Glul56 
(Bott et al„ 1987). Moreover, all of the substitutions are polar 
or charged, the energetics of which are expected to be the most 
long range. Thus, the mutant side chains alter substantially 
the intramolecular interactions between positions 156 and 166. 



Perspectives in Biochemistry 



Biochemistry, Vol. 29, No, 37, 1990 8511 



it Mutants vs the Multiple Mutant Where the Mutant or Wild-Type Side C 
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multiple 




component muta 






multiple 


assay 






mutant 


assay 




sum 


mutant 




Tyrosyl-tRNA Synthetase 






Subtilisin BPN' 








C35G + H48G' 








D99K + E156K 








ATP/PP, 






+2.30 


R 


+1.29 +2,12 




+3.41 


+2.74 


ATP/tRNA 


+ U05 + L13 


+2.18 






+0.13 -0.49 








Tyr/PP, 
Tyr/tRNA 


+ 1.14 +1.12 


+2.26 


+2J2 




E156S, 








+0,32 +1.12 
C35G +T51P 


+ 1.45 


+1.20 




G166A + G169A, 
Y217L' 




-1.86 
-0.09 


-1.76 
+0.02 


ATP/tRNA 


+ 1.20 -1.91 
+ 1.05 -2.35 


-0.71 
-1.30 


-K88 


F 
Y 


-0.40 -1.46 
+0.94 -1.03 
S24C, 
G166A+ H64A 
-0.40 +4.96 




Tyr/PP, 


+ 1.14 -0.64 


+0.50 


-0.74 










Tyr/tRNA 


+0.32 +0.50 
C35G + T5IC 4 


+0.82 


+0.21 


F 




+4.56 


+4.11 


ATP/tRNA 


+ 1.05 -0.93 


+0.12 


-0.22 


Y 


+0.94 +4.40 
E156S, ciar 
GI69A.+ u&iA 
Y217L HMA 

-K03 +4^40 
+G166A 




+5.34 


+5.84 


ATP/Tyr 


+ 1.14 -0.91 
H48N +T51A' 


+0.23 


-0.13 










ATP/PP, 
ATP/tRNA 


+0.26 -0.38 
-0.13 -0.32 
T40A + H45G' 


-0.12 
-0.45 


+0.04 
-0.37 


F 
Y 




+3.50 
+3.37 


+4.21 
+3.96 


Tyr/Tyr 


+5.02 +3.15 


+8.17 


+6.95 










ATP/Tyr 


+5.13 +2.44 

Rat Trypsin 


+7.57 


+6.67 










K 
R 


G2I6A + G226A* 
+2.75 +3.13 
+2.19 +4.91 


+5.88 
+7.10 


+5.90 


F 


+4.21 -0.40 
+3.96 +0.94 
S24C, E156S, 
H64A, + G169A, 




+3.81 
+4.90 


+3.53 
+6.07 


HjF 


Dihydrofolate Reductase (AAG^,,,,) 
F31V + L54G/ 

+ 1.6 +2.9 +4.5 




F 

Y 


G166A Y217L 

E156S, 
S24C, . G166A. 




+2.65 
+4.81 


+3.53 
+6.07 


MTX 


+2.2 +2.9 


+5.1 














Subtilisin BPN' 








HMA G169A. 

Y217L 
+4.96 -1.7S 










E156S + Y217L + G169A' 


-2.92 




F 




+3J0 


+3.53 




-1.43 -0.87 -0.62 


Y 


+4.40 +0.02 




+4.38 


+6.07 


Q 


-0.60 -0.36 -0.32 


-1.28 


-1*14 








A 


-0.15 -0.41 -0.27 


-0.83 
















+1.70 -0.08 -0.30 


+1.32 


+087 




A179G + R198M' 








M 


-0.86 -0.32 -0.39 


-1.57 


-Ml 


NADH 


-1.10 -0.62 




-1.72 


-1.32 


F 


-0.61 -0.29 -0.66 


-1.56 


-1.17 


NADPH 


+0.08 +2.68 




+2.76 


+2.11 




-0.24 -0.12 -0.41 


-0.77 






A179G + R204L 










E156S + Y217L 








-1.10 +0.41 










-1.43 -0.87 


-2.30 






+0.08 +2.42 










-0.60 -0.36 


-0.96 






R198M+ R204L 










-0.15 -0.41 


-0.56 






-0.62 +0.41 










+ 1.70 -0.08 


+ 1.62 






+2.68 +2.42 










-0.86 -0.32 
-0.61 -0.29 


-1.18 
-0.90 






A,79G+ R204L' 










-0.24 -0.12 


-0.36 






-1.10 -0.51 








E 


S+G169A 
-1.67 -0.62 


-2.29 


-2.06 


NADPH 


+0.08 +3.70 
R198M+A179G. 




+3!78 


+2!22 


Q 


-0.96 -0.32 


-1.28 


-1.14 


NADH 


-0.62 -1.54 




-2.16 


-1.72 


A 


-0.53 -0.27 


-0.80 


-0.92 


NADPH 


+2.68 +0.87 




+3.55 


+2.22 


K 


+ 1.33 -0.30 


+ 1.03 


+0.87 




R204L+ A17 'G. 
RZ04L+ R198M 








M 


-1.11 -0.39 


-1.50 


-1.41 










F 


-0.84 -0.66 


-1.50 


-1.17 


NADH 


+0.41 -1.32 




-0.91 


-1.72 


Y 


-0.32 -0.41 


-0.73 


-0.59 


NADPH 


+2.42 +2.11 




+4.53 


+2.22 




D99S + E156S» 








R179G + R198M+R204L 






R 


+0.47 +0.77 


+ 1.24 


+1.52 


NADH 


-1.10 -0.62 


+0.41 


-1.31 


-1.72 


F 


0 -0.62 


-0.62 


-0.52 


NADPH 


+0.08 +2.68 


+2.42 


+5.18 


+2,22 



ts of ATP-dcpendent p] 



!e (ATP/PP,) or tRNA charging (ATP/tRNA) 



h for Tyr/PP, exchange and Tyr/tRNA charging. *Lowe et al. (1985). The ATP/Tyr acti- 
vation assay refers to formation of tyrosyl adenylate under saturating concentrations of tyrosine. 'Jones et al. (1986). 'Leather barrow et al. (1986). 
The ATP/Tyr and Tyr/Tyr activation assays refer to formation of tyrosyl adenylate under pre-steady-state conditions, and *„/£ M is calculated from 
k,/K s for tyrosine and ATP, respectively. *Craik et al. (1985). The substrate was D-Val-Leu-(X)-aminofluorocoumarin where the PI residue (X) 
is either Lys (K) or Arg (R). ■'Mayer et al. (1986). The ligand was either dihydrofolate (HjF) or methotrexate (MTX). 'Wells et al. (1987a). The 
substrate was sucdnyi-L-A!a-L-Ala-L-Pro-L-(X)-p-nitroanilide where the PI (X) residue (Scbechter & Berger, 1937) was either Glu (E), Gin (Q), 
Ala (A), Lys (K), Met (M). Phe (F), or Tyr (Y). * Russell and Fersht (1987). The substrate was benioyl-L-Val-Gly-L-Arg-p-nttroaniiide (R) or 
succinyl-L-Ala-L-AIa-L-PTO-L-Phe-p-nitroanilide (F). 'Carter el al, (1 989). The substrate was succinyl-L-Phe-L-Ala-L-His-L-(X)-p-mtroamlide where 
X was either Phe (F) or Tyr (Y). 'Scrutton et al. (1990). The assay followed the reduction of oxidized glutathione by NADH or NADPH. 



8512 Biochemistry, Vol, 29. No, ./, 1990 



Perspectives in Biochemistry 



Table II: Comparison of Sums of AA<V from Component Mutants 
vi the Multiple Mutant Where the Mutant Side Chains Can Contact 

One Another 

AAG T ' ' 




sum multiple mutant 





Tyrosyl-tRNA Synthetase 






H48G + T51P* 








+ 1.04 -1.91 


-0,87 




ATP/tRNA 


+ 1.13 -2.35 


-1.22 






+ 1.12 -0.64 


+0.48 




Tyr/tRNA 


+ 1.12 +0.50 


+ 1.63 






+0,95 -1.99 






Tyr/ ATP* 


+ K07 -0J8 


+0^69 


+0^82 


H48N + T51P 






ATP/Tyr 


+0.18 -1.99 


-1.81 






+0.36 -0.38 


-0.02 


^064 


ATP/tRNA 


-0.02 -2.23 


-2,25 






N48G +T51P 






ATP/Tyr 


+0.37 -0.94 


-0.57 




Tyr/Tyr 


+0.41 -1.00 


-0.59 




ATP/tRNA 


+ 1.26 -1.05 


+0.21 




Q48G + T51P 








-1.31 -1.09 


-2.40 






-2.05 -1.65 


-3.70 




ATP/tRNA 


-1.87 -1.85 


-3,72 




H48Q + T51P 






ATP/Tyr 


+2.26 -1.99 


+0.27 




Tyr/Tyr 


+3.13 -0.38 


+2.75 




ATP/tRNA 


+3.11 -2.23 


+0.88 






Suhtilisin BPN' 








E156Q + G166iy 






Q 


-1.04 +1.27 


+0.23 






-0.45 +1.83 


+ 1.38 






+2.15 +0.53 


+2.68 






E156S + G166D 






Q 


-0.59 +1.27 


+0.68 






-0.85 +1.83 


+0.98 






+ 1.68 +0.53 


+2.22 






E156Q + GI66N 








-1.71 -0.11 


-1.82 




Q 








M 


-a45 +0J8 


-0^27 


-1.10 


K 


+2.15 +0.48 


+2.73 


+ 1,16 




E156S + G166N 






E 


-1.44 -0.11 


-1.55 


-0.51 


Q 


-0.59 +0.14 


-0.45 


-0.85 


M 


-0.85 +0.18 


-0.67 


-0.78 


K 


+ 1.68 +0.48 


+2.16 


+1.26 




E156S + G166K 






E 


-1.44 -3.49 


-4.93 


-4.49 


Q 


-0.59 -1.03 


-1.62 


-0.95 


M 


-0.85 -1.37 


-2.22 


-1.12 


K 


+ 1.68 +0.51 


+2.19 


+ 1.88 




E156Q + G166K 






E 


-1.71 -3.49 


-5.20 


-4.49 


Q 


-1.04 -1.03 


-2.07 


-0.95 


M 


-0.45 -1.37 


-1.82 


-1.12 


K 


+2.15 +0.51 


+2.66 


+ 1.88 



'See Tabic I for description assays. 'Lowe et al. (1985). 'Carteret 
al. (1984). 'Wells ttal, (1987b). 



In these six examples there are large and systematic discrep- 
ancies between the sum of the AAGt values for the single 
mutants and those of the corresponding double mutant (Wells 
et al., 1987b). In almost all cases, the sum of the AAGj values 
for the single mutants is much greater than the value for the 
multiple mutant Nonetheless, the AAGj value predicted from 
the sum of the single mutants does have the same sign as that 
for the double mutant, so that the single mutants predict 
qualitatively the effect on the multiple mutant. 

A plot (Figure 2) of the collective data set from Table II 
is in contrast to that seen in Figure V. The AAGj values for 
the multiple mutants correlate more poorly with the sum of 




-6 -A -2 0 2 4 

EaAGf components 

FIGURE 2: Data are taken from Table II for mutants of subtiHain (■) 
or tyrosyi-tRNA synthetase (O) where mutant or wild-type side chains 
can contact each other. The dashed line represents a theoretical line 
of unity slope, and the solid line represents the best fit. 

the component single mutants (R 1 = 0.72). Moreover, the 
slope of the line (0.61) is much below unity. This indicates 
that the function of one residue is compromised by mutation 
of another. Of the 40 additivity examples, the average con- 
tribution of the most dominant single mutant to the sum of 
the AAGt values is 71% (±13%) of the total. Thus (as in 
Figure 1), both single mutants can contribute substantially 
to free energy changes measured in the multiple mutant. 
However, this data set is derived from mutations at only two 
different sites on two different proteins. 

In summary, complex additivity can be observed when 
mutations at sites X and Y change the intramolecular inter- 
action energy between sites. This can be mediated by direct 
steric, electrostatic, hydrogen-bonding, or hydrophobic in- 
teractions or indirectly through large structural changes in the 
protein, solvent shell, or electrostatic interactions. Complex 
additivity is most likely to occur where the sites of mutation 
are very close together and larger or chemically divergent side 
chains are introduced. 

(5) Mutations at Sites X or Y Change the Enzyme 
Mechanism or Rate-Limiting Step. If the catalytic functions 
of two or more residues are interdependent, then a mutation 
of one residue can affect the functioning of the other(s). This 
form of complex additivity is well illustrated for mutations in 
the catalytic triad and oxyanion binding site of subtilisin 
(Carter & Wells, 1988, 1990). In the catalytic mechanism 
of subtilisin (Figure 3), the rate-limiting step in amide bond 
hydrolysis is transfer of the proton from Ser221 to His64 with 
nucleophilic attack upon the scissile carbonyl carbon. This 
is accompanied by electrostatic stabilization of the protonated 
imidazole by Asp32 and hydrogen bonding to the oxyanion 
by the side chain of Asnl5S and the main-chain amide of 
Ser221. Mutational analysis shows that once the catalytic 
Ser221 is mutated to Ala (S221A), additional mutations in 
the triad or oxyanion binding site cause no further loss in 
catalytic efficiency (Table III). 

The S221A enzyme retains a catalytic activity that is still 
10* above the solution hydrolysis rate (Carter & Wells, 1988). 
It is proposed that this residual activity is derived from re- 
maining transition-state binding contacts outside of the cat- 
alytic triad coupled with solvent attack upon the carbonyl 
carbon from the face apposite position 221 (Carter & Wells, 
1990). This proposal is based on a model showing that there 
is no room for a water molecule near Ala221 once the substrate 
is bound. Furthermore, conversion of Asnl55 to Gly enhances 
the activity of the S221A mutant by -1 .2 kcal/md (Table III). 
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TtNe lit: Comp»r»on of Satm of AA&r* from Compooeal MuUntt 
vt the AACr' for Maniple Mutuits in tbe Catilytk Trad tad 
Oxyinion Binding Site of SubtiUria BFN'* 



ES ES* E-Ac 

fxkjhe3: Schen*t»s diagram* the mech^^ Reproduced 
with permmioa from Carter tnd WeUs (1988). Copyright 1988 MacmilUn. 

activity could not go beyond the diffusion-controlled limit 
(Albcry ft Knowles, 1976). 
Additive Effects on Substrate Binding 

The analysis above considered changes in binding free cn- 
ergies between the free enzyme and substrate <E + S) to yield 
the bound transition-state complex (E-S*). The steady-state 
kinetic analysis for subtil isin and tyrosyi-tRNA synthetase is 
such that the K u values approximate tbe enzyme-substrate 
dissociation constant K,. Additivity analysis based on calcu- 
lations of AAG u-q. , (from A' M values) or AAG M (from 
values) yields qualitatively the same results (not shown) as 
shown in Tables I and D and Figures 1 and! Thus, deviations 
from simple additivity are not systematically found in either 
" nn the E5 complex or those to reach E-S\ 



S221A + H64A» 
+8.93 +8.84 
S221A + D32A 
+8.93 +«-S2 

H64A + D32A 
+8.B4 +6J2 

S22IA + H64A + D32A 
+8.93 +8.84 +4.52 



+8.83 
+8.8« 



S221 * * D32A 
+8.93 +7.48 
H(UA + S22IA 
HMA + W2A 

+8.84 +8.8« 

+6J2 +8.83 
S221A + N155G' 
+8.93 +3.08 



'All enzymes were isaycd with the suhrtrate socchryi-L-Als-L-AU- 
L-Pro-L-Phc-n-nilroinilide. 'Oner and Wells (1988). 'Carter sod 
Wcfe (1990). 

This is consistent with the opposite-face solvent attack 
mechanism of S22IA, because the oxyanion (Figure 3) would 
develop away from Asnl55 and the N155G mutation improves 
solvent accessibility to the scissile carbonyl carbon. 

Complex additivity is also seen for subtilistn mutated at 
positions 64 and 32. The double (H64AJD32A) and corre- 
sponding single mutants show a linear dependence upon hy- 
droxide km concentration (between pH 8 and 10) that may 
reflect hydroxide assistance in the deprotonation of the Oy 
of Ser221 (Carter ft Wells, 1988). Thus, once His64 is 
converted to Ala, Asp32 is a liability, presumably by elec- 
trostatic repulsion of hydroxide ion. [Note the -1 J kcal/mol 
improvement in AAOy for the double mutant (H64A.D32A) 
compared to H64A alone; Table f IL] 

Id summary, if an enzyme mechanism relies upon cooper- 
ative interaction between two or more residues, then multiple 
mutations within this subset can result in large values for 
AG T(J) . In fact, if the mechanism is changed substantially, 
residues that were a catalytic asset can become a liability. 
Simple additivity can also break down when one or more of 
the mutations cause a change in tbe rate-limiting step. In an 
extreme case, one may have a number of mutants in an enzyme 
that enhance tbe activity, but tbe cumulative enhancement of 



Additive Effects on Protein-Protein Interactions 

The first cleaT examples of additive binding effects caused 
by amino acid replacements in proteins were reported by 
Laskowski et ai (1983) and reviewed by others (Ackers & 
Smith, 1985; Horovkz ft Rigbi, 1985). One hundred natural 
variants of a proteinase inhibitor, the ovomucoid third domain, 
have been isolated and sequenced from tbe eggs of different 
bird species (Empie ft Laskowski, 1982; Laskowski et al„ 
1987). This is a nested set of proteins because for any one 
of these avian inhibitors there is a close relative containing only 
one or a few amino acid substitutions. Moreover, the asso- 
ciation constants (KJ of these inhibitors with a variety of serine 
proteinases vary over an enormous range (10*-fbkl). Laskowski 
et aL (1983, 1 989) have shown that the effect of a given residue 
replacement on AT, is about tbe same irrespective of the in- 
hibitor scaffold tbe replacement is made in. 

In addition to ovomucoid, four additivity examples have been 
constructed from natural variants at tbe subunh interface of 
tetramerk hemoglobin (Ackers ft Smith, 1985). Three ad- 
ditivity examples have been analyzed for interactions of bGH 
with its receptor (B. C Cunningham and J. A. Wells, un- 
published results) and ooe example for association of synthetic 
variants of the RNase S peptide with RNase S protein 
(Mhchinson ft Baldwin, 1986). Tbe entirety of this data set 
is not tabulated because much on the ovomucoid inhibitors 
and bGH is unpubbsbed. Nonetheless, these researchers were 
kind enough to provide their data formatted so it could be 
These data consist of 91 
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lAAG^,^ components 

fioure 4: Plot showing the sum of changes in free energies of binding 
at proteins-protein interfaces for component mutants versus the 
corresponding multiple mutant. Data represent interactions between 
ovomucoid third domain and various serine proteases (□) (R. Wync 
and M. Laskowski, personal communication), regulatory interface 
of ojSj hemoglobin (•) (Ackers & Smith, 1 985), hGH and its receptor 
(stippled A) (B. Cunningham and J. Wells, personal communication), 
and RNase S peptide and S protein (■) (Mitchinson & Baldwin, 
1 986). The dashed line represents a line of unity slope, and the solid 
line is the best fit. 

kcal/mol). The plot shows a very strong linear correlation (R 2 
» 0.96) with a slope near unity. Although the data for the 
ovomucoid were not sorted to evaluate changes at intramo- 
lecular contact sites, most are not expected to be in contact, 
and all of the other examples represent noncontact sites. Thus, 
the large data base derived from natural variants of ovomucoid 
third domain, as well as a smaller number of examples from 
several other proteins, indicates that multiple mutations at 
protein-protein interfaces commonly produce simple additive 
effects. 

Additive Effects in DNa-Protein Interactions 

One of the clear advantages in analyzing DNA-protein 
interactions is the ability to apply powerful selections that make 
analysis by random mutational studies feasible. Additivity 
in DNA-protein interactions was first demonstrated by re- 
version analysis of X repressor (Nelson & Sauer, 1985). A 
mutation that decreased the binding affinity for the X operator 
site (K4Q) was reverted by mutations at several second sites 
(E34K, G48S, and E83K). When these second-site revertants 
were introduced into wild-type X repressor, they caused in- 
creases in affinity similar to those observed in the first-site 
suppressor mutant (K4Q), 

Functional independence for mutations at DNA-protein 
contacts has been demonstrated by additive effects for mutants 
of CAP (catabolite gene activator protein) and its operator 
sequence (Ebright et al., 1987) as well as lac repressor and 
its corresponding operator sequence (Ebright, 1986). Simple 
additivity of mutational effects in the operator sequences for 
Cro repressor (Takeda et al., 1989) and X repressor (Sarai & 
Takeda, 1989) has been most systematically demonstrated. 
Simple additivity has also been reported for multiple mutations 
in the lac repressor (Lehming et al., 1990). In fact, simple 
additivity is so predictable in DNA-protein interactions that 
the observation of complex additivity has been used to predict 
specific DNA-protein contacts in the lac repressor-operator 
complex (Ebright, 1986). 

Additive Effects on Protein Stability 

The first systematic analysis of additive effects of site- 
specific mutations on protein stability was reported by Shortle 
and Meeker (1986). Five multiple mutants in staphylococcal 
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Table IV: Comparison of Sums of AAG^ojan, from Component 
Mutants vs the Multiple Mutant 



AAG„ 









multiple 




Staphylococcal Nuclea 








V66L + G79S" 






GuHCI 


-0.2 -2.6 


-2.8 


-3.3 


urea 


+0.2 -2.9 


-2.7 


-3.6 




V66L + G88V 






GuHCI 


-0.2 -1.0 


-1.2 


-2 1 


urea 


+0.2 -0.9 


-0.7 


-M 




I ISM + A69T 








-0.6 -2.7 


-3.3 






-0.7 -2.9 


-3.6 






I18M + A90S 








-0.6 -1.4 


-2.0 






-0.7 -1.4 


-2.1 


-2.2 




V66L + G79S + G88V 








-0.2 -2.6 -1.0 


-3.8 






+0.2 -2.9 -0.9 


-3.6 






N-Terminal Domain of X Repressor 






G46A + G48A* 








+0.7 +0.9 


+ 1.6 






T4 Lyuzymc 








I3C + C54V 








+ 1.2 -0.7 


+0.5 






13C + C54T 






thermal melt 


+ 1.2 +0.3 


+ 1.5 


+ 1.5 




I3C + C54T + R96H 






t erma met 


+ 1.2 +0.3 -2.8 


-1.3 






I3C.C54T + R96H 






thermal melt 


+1.5 -2.8 


-1.3 


-2.5 




13C + C54T+ A146T 






thermal melt 


+ 1.2 +0.3 -1.5 


0 






I3C.C54T + A146T 






thermal melt 


+ 1.5 -1.5 


0 


-0.5 




Bacteriophage fl Gene V 






V35I + 147V 






GuHCI 


-0.4 -2.4 


-2.8 


-2.9 




KringIe-2 of tPA 








H64Y + R68C 






thermal melt 


+2.9 +0.7 


+3.6 


+ 3.4 




Turkey Ovomucoid Third C 


•omain 






G32A + N28S'' 






thermal melt 


+0.8 -0.5 


+0J 


+ 0.2 




Y20H + N45-CHO 






thermal mell 


-0.8 +0.3 


-0.5 


-0.6 




o Subunit of E. colt Trp Synthetase 






Y175C + G21IE' 






GuHCI 


-0.1 +0.3 


+0.2 


-1.3 



"Shortle and Meeker (1986). ^Hccht et al. (1986). 'WeUel el al. 
(1988). 'Sandberg and Terwilliger (1989). *R. Kelley, personal 
communication. 'Otlewski and Laskowski (1990). N45-CHO refers 
lo a glycosylation of A»n45. 'Hurle el al. (1986). 



nuclease were constructed from a group of random single 
mutants that were screened initially for their ability to affect 
the stability of the enzyme in vivo. The component mutants 
do not make direct contact with each other in the multiple 
mutants. Generally, these variants exhibit nearly additive 
effects except for the double mutant V66L.G8BV (Table IV). 
In addition to those of staphylococcal nuclease, additive effects 
on the AAG^ohj,,, (assayed by reversible denaturation) have 
also been determined for the N-tcrminal domain of X repressor 
(one example; Hecht et al., 1986), the a-subunit of E. colt Trp 
synthetase (one example; Hurle et al., 1 986), T4 lysozyme (six 
examples; Wetzel et al., 1988), the gene V product of bac- 
teriophage f] (one example; Sandberg & Terwilliger, 1989), 
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Z&aG^oo^ components 

FIGURE 5: Plot showing sum of changes in free energy of unfolding 
of component mutants and resulting multiple mutant. Data are taken 
from Table IV and represent staphylococcal nuclease (■), N-terminal 
domain of \ repressor (O), T4 lysozyme (D), bacteriophage fl gene 
V product (•), Kringle-2 domain of tissue plasminogen activator (a), 
turkey ovomucoid third domain (A), and the a-subunit of Trp 
synthetase (V). The dashed line represents a theoretical line of unity 
slope, and the solid line represents the best fit. 

natural variants of ovomucoid third domain (two examples; 
Otlewski & Laskowski, 1 990), and the Kringle-2 domain of 
human tissue plasminogen activator (t-PA) (one example; R. 
Kelley, personal communication). 

Collectively, this data set gives a high linear correlation (R* 
= 0.94) and slope near unity (Figure 5). The generally simple 
additive behavior is somewhat surprising given the highly 
cooperative nature of protein folding. There are discrepancies 
in some of the adtjitivity examples besides the staphylococcal 
nuclease mutant (V66L.G88V). For example, the 1.5 
kcal/mol discrepancy for the Y175C.G271E double mutant 
in Trp synthetase (Table IV) is proposed to result from the 
fact that these residues are in direct contact (Hurle et al„ 
1986). Furthermore, proximity effects may account for the 
large differences between the sum of the component mutants 
and the multiple mutants for the a-helical double glycine 
mutant G46A,G48A in \ repressor (Hecht et al„ 1986), and 
when combining R96H with the C3-C97 disulfide mutant in 
T4 lysozyme (Wetzel et al., 1988). In contrast, an exchange 
of two side chains that contact one another (V35I and I47V) 
in the hydrophobic core of the gene V product of fl phage 
produced simple additive effects (Sandberg & Terwilliger, 
1989; Table IV). It should be noted that this data base ex- 
hibiting simple additivity may be biased for single mutants 
that stably fold, because severely unstable proteins are more 
difficult to express. 

By analogy to transition-state binding effects, one can 
certainly imagine instances where the stabilizing effects of 
mutations should reach a plateau. For example, denaturation 
at high temperatures can become controlled by a chemical step 
such as deamidation (Ahem et al., 1987), so that additional 
mutants that stabilize the folded form of the protein may be 
irrelevant Another obvious example where complex additivity 
can be observed in protein stability is the stabilizing effect of 
disulfide bonds and noncovalent intramolecular contacts that 
require interactions between two or more residues. In these 
cases, the stabilizing interaction between two side chains can 
be broken with only one mutation. 

Applications of Additivity in Rational Protein 
Design 

A strategy of additive mutagenesis, where a series of single 
mutants each making a small improvement in function are 



combined, is one of the most powerful tools in designing 
functional properties in proteins. This approach has been 
remarkably successful in stabilizing proteins to irreversible 
tnactivation, such as \ repressor (Hecht et al., 1986), subtilisin 
(Bryan et al., 1987; Cunningham & Wells, 1987; Pantoliano 
et al., 1989), kanamycin nucleotidyltransferase (Liao et al., 
1986; Matsumura, 1986), neutral protease (Imanaka et al., 

1986) , and T4 lysozyme (Wetzel et al., 1988; Matsumara et 
al., 1989). This strategy has been applied to enhancing the 
catalytic efficiency of a weakly active variant of subtilisin 
(Carter et al., 1989), engineering the substrate specificity of 
subtilisin (Wells et al., 1987a,b; Russell & Fersht, 1987) and 
the coenzyme specificity of glutathione reductase (Scrutton 
et al., 1990), designing protease inhibitors with exquisite 
protease specificity (Laskowski et al., 1989), and recruiting 
human prolactin to bind to the hGH receptor (Cunningham 
et al., 1990). In addition, additivity principles have been used 
to engineer the pH profile of subtilisin (Russell & Fersht, 

1987) and to design the affinity and specificity of X repressor 
(Nelson & Sauer, 1985). 

For this approach to work does not require that all the 
component mutants act in a simply additive manner but just 
that their effects accomulate. For example, despite the com- 
plex additivity of effects in the catalytic triad of subtilisin, there 
are mutagenic pathways that are energetically cumulative for 
installing the triad (Carter & Wells, 1988; Wells et al„ 1987c). 
Starting with the triple mutant S221A,H64A,D32A, there is 
a progressive enhancement for installing Ser221 (-1.1 kcal/ 
mol), then His64 (-1.0 kcal/mol), and finally Asp32 (-6.5 
kcal/mol). Another cumulative pathway of Ser221, then 
Asp32, and finally His64 is possible if the Ser22I,Asp32 in- 
termediate were to use HisP2 substrates (Carter & Wells, 
1987). Elaborating such cumulative pathways is important 
for understanding how a catalytic apparatus may have evolved 
and is practically useful for considering how to install such 
catalytic machinery into weakly active catalytic antibodies. 

Conclusions 

In the majority of cases, combination of mutations that 
affect substrate or transition-state binding, protein"-protein 
interactions, DNA-protein recognition, or protein stability 
e xhibits simple ad ditivity. Simple additivity is commonly 
o bserved for distant mutations at rigid molecular interfaces 
such as in protein- protein and DNA-protein interactions, 
where the mutations are unlikely to alter grossly the structure 
or mode of_binding. 

Urge deviations from simple additivity can occur when the 
sites of mutations strongly interact with one another (by 
making direct contact or indirectly through electrostatic in- 
teractions or large structural perturbations) and/ or when both 
sites function cooperatively (as for the catalytic triad and 
oxyanion binding site of subtilisin). Changes at sites that can 
contact each other do not always lead to complex additivity; 
this may reflect relatively weak interactions between the two 
sites or indicate that the interactions are compensatory and 
appear to be weak. 

It is important to point out the magnitude of errors in 
predicting the free energy effect in the n^ltiplfijnutarjllrom 
the component single mutants. Generally, for those cases 
exhibiting simple additivity (Figures 1, 4, and 5), the dis- 
crepancy in free energy between the sums of the components 
and multiple mutants is about ±25%, Part of this is the result 
of compounding errors when summing the single mutants, and 
the rest is presumably due to weak interaction terms. 
Nonetheless, this means that if the total free energy change 
is about 3 kcal/mol, the change in the equilibrium constant 
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(related by K^/K M - 10~ 3 /kt = 155) will often be off by a 
factor of 4. Thus, while the free energy effects accumulate, 
significant deviations will occur in predicting the final equi- 
librium constants when component mutants contribute a large 
free energy term. 

Simple additivity reflects the modularity of component 
amino acids in protein function. This results from the fact 
that the perturbations in energetics and structure resulting 
from most mutations are highly localized. In the past six years, 
an additive mutagenesis strategy has been extremely effective 
in engineering proteins — of course, nature has been using this 
strategy much longer. 
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abstract: Femtosecond spectroscopy was used in combination with site-directed mutagenesis to study the 
influence of tyrosine M210 (YM210) on the primary electron transfer in the reaction center of Rhodobacter 
sphaeroides. The exchange of YM210 to phenylalanine caused the time constant of primary electron transfer 
to increase from 3.5 ± 0.4 ps to 16 ± 6 ps while the exchange to leucine increased the time constant even 
more to 22 ± 8 ps. The results suggest that tyrosine M210 is important for the fast rate of the primary 
electron transfer. 



The primary photochemical event during photosynthesis of 
bacteriochlorophyll- (Bchl-) containing organisms is a light- 
induced charge separation within a transmembrane protein 
complex called the reaction center (RC). The crystal struc- 
tures of RC's from Rhodopseudomonas (Rps.) uiridis and 
Rhodobacter (Rb.) sphaeroides have been solved to high 
resolution [reviewed in Deisenhofer and Michel (1989), Chang 
et al. (1986), Tiede et al. (1988), and Rees et al. (1989)]. The 
RC from Rb. sphaeroides contains three protein subunits 
referred to as L, M, and H, according to their respective 
mobilities in SDS-polyacrylamide gels. Associated with the 
L and M subunits are the cofactors, consisting of four Bchl 
a, two bacteriopheophytin (Bph) a, one atom of non-herae 
ferrous iron, two quinones (Qa and Q B ), and in some species 
one carotenoid [reviewed in Parson (1987) and Feher et al. 
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(1989)]. The cofactors are arranged in two branches (Figure 
1 ) with an approximate C, axis of symmetry. The kinetic data 
support a model in which the primary electron transfer pro- 
ceeds after light absorption by the primary donor [a special 
pair of Bchl referred to as P; reviewed in Kirmaier and Holten 
(1987)]. The absorption of light generates the excited elec- 
tronic state P*. which has a lifetime of approximately 3 ps. 
An electron is transferred from P along only one branch (the 
so-called A-branch). It is generally accepted that after ap- 
proximately 3 ps the electron arrives at the Bph on the A-side 
(H*) and after 220 ps it reaches Q A . The role of the accessory 
Bchl located between P and H A (referred to as B*) has not 
been definitely assigned. Recently, we have shown that at 
room temperature an additional kinetic (t - 0.9 ps) component 
is detectable (Holzapfel et al., 1989). The spectral properties 
and the kinetic constants lead to the conclusion that the 
corresponding intermediate is the radical pair P + B A " (Hol- 
zapfel et aJ., 1990). 
Additional intriguing points concerning the process of 
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which are parriai sequences of done* tat are 
ranging from 27.462 lo 3 1 2,278. The melh- often error prone, are housed in public and 
J major achievement, but the meaning or oda used to arrive u there numbers each proprietary repositories. These numbers will 
I the mast of accumulated data is only involve different ipproximellons and ex- snowball witft the 6uioon of further genome 
jua beginning to be unraveled. Al Tint sight, trapolatioaa. Nevertheless, il is disturbing projects. By contrast, the number of unique 
J te task appears straightforward: locate the that the different analytical approaches protein structures is still less than 2000. OT 
""" •■ do not know how many unique ac- , 

aisdaardol I 



previously characterized k . „ 

sign function by evohtionary m/orcocc; and gene counting is that even the definition of it is icurarativc that wc focus on deciphering 

ralionaKrc the junction in «tractun|l terms us- • gene ii unclear. Is it a heritable unit cor- the structural, functional, and evolutionary 

ing known or modetderived structure*. Oiv- responding to an observable pheuotype 7 cJua encoded in the language of biological 

enlhe<iuariatyofdili. theptoccdurtaahouW Or la it a packet of genetic information ternencec IWo distinct analytical apecoach- 

be automated as much as posdble. thai encodes a protein, oe proteins? Of per- cs have emerged. Pattern recognition meth- 

TK«n^ofcouae,U W «sosin>p^At- haps one thai encode* RNA7 Must il be ods aim to detect similarity between se- 

tetnpta to decipher the dues latest m genom- transUted7 Are genre genes if they are not quence* and structures an* infer related 

ic data are hampered because current metfc- expressed? As dcTuutk>Cj,vary/ii^ubh- functions, thus. It 
-ods to predka-gena in uncWctair«J DNA 
are taasefiaUe (and it k not ahvay. dear what 
wemcaaby-genO: itl. 




-gene finders" Identified 95* of Coding 
nucleotides, but inttou/exon structures 
were correctly predicted for only about 
40H of genes. Th* different methods failed 
to find between 5% and 95% of genes, and 



Hon ttut emerge (motifs) begin to provide kmctlonal etnas, for example, thi above motifs (Q may 
pmvtdlng different *g A ££^^£^£^ 
tbucturetmctslb- 
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function prediction through pattern il but Slrueure prediction methods rang. 
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ect fold, but remit! deteriorate beyond 
re* rmrteinto the query aquwee- -100 reiidtiee. Today, kaowledge-biied 
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^i^Be d« tVnctk^>k. cnmnlrie evffcm; know- the primary itructure of l 

■ -• - - 1 — ^— ' — » • ■< ■■' modules to different proteii 

'•t^o^f^rT.l.o complicate ft* j^y P«*»M ^^^^ 
n«rfloa Krianineat KBC ftisctiooj may (although membra™ protein, structures 
«SL b^^oW^»o^«^d^llc2- wiB be fllTicd to obt»fab«<^ d»«y «. 
» >dd- be reouaoani, wn» to cy.ulUK). We aunt keep to 

mind, however, thai structure alone wiU 



(0. 




^beoorrecVc.rnistolca, ^A^^^^^ Ifo-TS ^JSSS^jM* 

field and botosr reflect Otologics! reality, tode- ondary structures), *s webJteeture (gross 

^^^=£=^^4; ^^^e^g^r*: si'-fr^K 

perform ^>°° " msloly B. eta.* Whore docs a -reasonably 

^W^rSTI^,^ ™cT^^3»^ goc7predktL fsjl fa dUa hlerercky, snd 

= ^Sr3E£ r sxs^^^ wP^^^^S 



but ehetailiriryb* fact («). 



^b^artojrU|^duplicatk«tr*^^ Strucoa^sradlctto! 




BEST AVAILABLE COPY 



and atntctures. h it helpful to dense our . 
(emu precisely aw* be honest about our 
echievcrnenli. Otherwise, we wiU continue 
to be baffled by paradoxical new prediction 
methods that yieW >«0% error rates. Gene 
idcalificatioa. nructure prediction, end 
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Conqu ring by 
Dividing 

The avenge personal computer 
spends much lew then half > day 
actually performing ««M eompa- 
tations. Many users, concerned about the 
vulnerability of expensive electronic 



verfatd or (be program for Macintosh and 
Solaris systems art planrxd- 

Thc benefits of the Popular Power 



continue to be made in all 

Mature functions by integration, and the components to the eonstaot 
adoption of a more holistic view of complex 
biological systems is aa caaential next step 
Joe bioudbrrnatica. Tb get the most from ge- 
nomic data, we need to tike account 0/ in- 
ftmrotion 00 tfas rcgidatiOT of pne expres- 
sion, metabolic pathways, and sipttUng cas- 
cades. ProUina do not work in Isolation but 
are involved in Interrelated networks. Un- 
raveling these networks and their interac- 
tions will be vital to our understanding of 
normal and pathologic cell development. 



cycling of the power on and 
off, leave their ayatema on 
continuously. It i* staggering 
to imagine the enormous, un- 
used computing resources of 
several million PCa left run- 
' One popular 



not accrue solely to. lite user whose com- 
puter ia used. The flexible nature of Popu- 
lar Power's deeign. provides aecese for 
businesses, scicntiata, and anyone with 
massive computing projects' to computing 
power that is potentially far greater than . 
they would gain front a fitted piece of 
hardware. Personal computer uscra might 
be able to select which com- 
mercial job to run through 
Popular Power Worker de- 
pending on the return ottered 
by the originating contractor. 
A key to the success of the 
computing model ii likely to 
be .the price Popular Power de- 
mands for acting sa the inter- 
face between tho computing project ere- \ 



Popular Power i 

. Worker I 

Popular Power, Inc. | 

. SanFrancJsco.C*. J 



and win hek.ua create aa Integrated map- into pieces that can be tsolyed on personal Popular Power Worker 
ping between grotype and r*Wrp±_ compters ^ j^**!*?^ _ . ^_.^^ Ia ?L fi 
. "OeoomlcFbaaed "drag 'discovery H 



heavily dependent on accurate 
insotadon. Toward this end, bio informatics 
wiU need to deliver highly integrated, Inter- 
operable d at abases (and data "warehous- 
es") Hut aflow the user (O reason over db- 
— and uUimately enable 




SBTI. a company computer feeds pieces ' pre-release product. The remaining users 
"of large computing problems to net- are advised to wait at least for (he official 



lb operation. Popular Power* ap- 
proach differs, however, in providing, a 
variety of computing problems to work 
00. These include nonprofit projects with 
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lea does not yet yield all the answers, but a 
future holisds approach should help fuse 
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er Warsaa* tuns only 00 winnows ana wm- _ , , 

« system and b otTMallv la prsweknoe EVCS Oil the SkI©S 

form The preUmlnary itatus of the toft- * . _ . 

waif is readily apparent; mrmerouj bugr. .ha orbital space above Earth e- 

sreqnest crashes, and dblicaloet in inrtal- I tains act astonishing collectiot 

Unoo plague ma program currently. If In- I man-mada anteUJtes. Tracking aJ 

rormadnrTat tbs company Web' lite U accu- these objects is no small task. Liftoff 



in Popular fWr* cortrrnrtmg model may ware'tooli to locate, track, and Identify 

find dealing with (he proWnni of the early Earth-orbiting 

release worth their while. Users of the pre- satellites . At the 

release software axe promised priority or Web site, three pro- 

•ccets to commercial computing jobs alter, grama are available: 

the official veralon U releetel PopuUr J-Peia (idantlfios 

Power Wotker can be downloaded for free satellites passing 

from the company H Web die, and it Installs overhead); /-Track 




'. • — Iowa ooe to view satellites orbiting Earth 

TsAstiht u prttUhW tn thithw enie of tun front a perspective Tar away in space). 
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From genes to protein structure and function: 
novel applications of computational 
approaches in the genomic era 



Jeffrey Skolnick and Jacqueryn S. Fetrow 



llie genome-sequencing projects are providing a detailed 'parts list* of Bfe. A key to comprehending this list is understanding 
the function of each gene and each protein at various levels. Sequenc erased methods for function prediction are inadequate 
because of the multifunctional nature of proteins. However, just knowing the structure of the protein is also insufficient for 
prediction of multiple functional sites. Structural descriptors for protein functional sites are crucial for unlocking the secrets 
In both the sequence and structuraifienornics projects. 



^%epomo-*cquencinB projects 
I _ derailed 'para Hit' for Bfe. Unfortunately, this list, 
\vla portion of which represents the amino acid 
sequence of all the proteins in a given genome, does 



Obviously, the complete characterization ol . 
function is difficult but efforts arc under way at ail levels'- 4 , 
including cellular function**. In this article, however, 
we focus on identifying the biochemical function of a 




x respmuibl 
n of the DNA itself. 
This it not unlike giving a child a list of parts r. 
essary to create a working automobile. Without the 
necessary expertise, creating tbe final, working car from 
just the initial parts list is a nearly impossible task. Simi- 
' 'y, understanding how t ' 



What is a protein function? 

After a genome is sequenced and its 
list determined, the next goal is to under 
tion(i) of each pare, including that of die proteins. What 
do we mean by protein function, the focus of this article? 

Function has many meanings. At one level, tbe pro- 
tein could be a globular protein, such as an enzyme, 
hormone or antibody, or it could be a structural or 
membrane-bound protein. Another level is its bio- 
chemical function, such as the chemical reaction and 
the substrate specificity of an enzyme. The regulatory 
molecules or cotactors that bind to a protein are also 
levels of biochemical function. 

' At the cellular level, the proteins function would 
involve in interactiou with other nucromoleculea and 
tbe function and cellular location of such complexes. 
There is also the proteins physiological function; that 
is, in which metabolic pathway the protein is involved 
or what physiological rale it performs in the argaruun. 
Finally, the phenotypic function is the role played by 
the protein in the total organism, which is observed by 



is the most com- 
monly used nation-prediction method. This robust 
field is well developed and, in the interest of space 
limitations, we will merely present a brief overview. 
There are two main' flavors of this approach: sequence 



Prosite" 5 , Blocks", Prints'-" and Emotif". Doth the 
alignment and the motif methods are powerful but a 
recent analysis has demonstrated their significant Mini— 
tattoos 11 , suggesting that these methods will increasingly 
fail 23 the protein-sequence databases become more 

An extension of these approaches that combines 
protein-sequence with Knicrural information has been 
developed and some successes have been reported". 
However, this method stiU applies the structural infor- 



In addition, proteins can gain and lose function dux- • 
ing evolution and may, indeed, have multiple fu 
" - • 1)- Sequence-tc ' 
learify these ct 
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In a sense, this is one long-term goal of "structural 
genomics' projects 1 *", which are designed to deter- 
mine all possible protein (bids experimentally, just 
as genome-sequencing projects are determining all 
protein sequences 31 . This is in contrast to traditional 
structural-biology approaches, in which one knows the 
proteins function first and only then, if the fi 



It is implicitly assumed that having the. protein s struc- 
ture will provide insights into its function, thereby fur- 
thering die goals of the human-genome-sequencing 
project- However, knowing a protein V three-dimensional 
structure is insufficient to determine its function 
(Bent 2). What we really need to analyse and predict the 
i nonfunctional aspects of proteins is a method spc- 



AOivt-iile idtntijUalion 

In order to use a structure-based approach to function 
prediction, one must identity the key residues respon- 
sible for a given biochemical activity. For many years, 
' es in proteins are 
d. Taken to the 
le could not only identity dis- 
■me global fold and the same 
activity but also proteins with similar functions but 



The validity 

empirically by Nussinov and co-workers, who showed 
that the active sites of eukaryotic serine proteases, sub- 
rjlifdns and sulfhydryl proteases exhibit similar structural 
* a recent modeling study of 
tins, protein functional sites 

x found to be more conserved than other parts of 

the protein models 11 . Similarly, it has been demon- 
strated that the catalytic triad of the a/P hydrolases 
is structurally better, conserved than other histidine- 
con raining triads 21 . A comparison of the structure of the 
hydrolase catalytic triad to other hi tidine- containing 
triads shows a distinct bunodal distribution, while a 
similar analysis done with a randomly selected triad shows 
a urumodal distribution (Fig. 1). 
Kasuya and Thornton 1 * generalized this example by 



Box 1. Pro tains are murtrfunetional 



A common protein characteristic that makes functional snails based 
only on homology Especially difficult is the tendency crt proteins to be 
murtifunctwial. For instance, lactate dehydrogenase binds NAD, sub- 
strate and zinc, and performs a redox reaction. Each ot mesa occurs 
at different functional sites that are in dose proximity and the combi- 
nation of an four sites creates the fuUy functional protein. 

Other examples ol multifunctional proteins are the rwdeic-acttMjhtBng 
proteins. For instance, DMA regulatory proteins often contain a DNA- 



regutatory proteins; a classic example is RecA*». The 3C rhinovirus 
protease exhibits a proteolytic function as well as an RNA-binding 
function 60 * 1 . Transcription factors are also complex, multifunctional 
proteins 0 . It is becoming increasingly important to recognize each of 
these different functions of gene products ol a newly sequenced gene. 

The sertrte-truxofwte-ptosphalase super! amity Is a prime example of 
the aTmcultias of using standard sequence analysis to recognize the 
multiple functions found In single proteins. This large protein family is 
divided into a number of subfamfes, an of which contain an essential 
phosphatase active site. Subfamilies 1, 2A and 2B exhibit 4 OX or more 
sequence identity between them 53 . However, each of these subfamilies 
is apparently regulated (fiffarentty In the cell"-" and observation sug- 
gests that ttiBra are different functional sites at which regulation can 
occur. Because the sequence identity between subfamilies is so high, 
standard sequence-similarity methods could easily misclassify new 
sequences as members of the wrong subfamily if the luncoona! sites 
are not carelutty considered, as was recently demonstrated* 3 . 

These are but a few examples of the murtifunctionality of proteins. 
The recognition of this multifunctional nature is of critical importance 
to the genomics field. Useful functicrial-arinotation methods must con- 
sider all of the specific functions in a given protein and will not just 



Unfortunately, most of these methods require the 
exact placement of atoms within protein backbones and 
side chains, and so have not been shown to be relevant 
to inexact predicted structures. Recently, however; we 
described the production of furry, inexact descriptors 




of the protein. 



Identifying active dui In experimental i 

Historically, several groups have attempted i 
tify functional sites in proteins; these efforts 
directed at protein engineering or " " J * ? 
sites in places where they did not prevrouxry exik. l uw 
lias been successfully accomplished for several metal- 
binding sites 15 - 31 . However, highly accurate functional- 
site descriptors of the backbone and side-chain a 
required, fueling the belief that rigruflcant atomic detail 



Highly detailed residue side-chain descriptors of the 
active sites of serine proteases and related proteins have 
A to identify functional sites 3 . The ui 



screen high-resolution i 
protein database". In ■ dataaet of 364 pi 
the FPF accurately identified all pro 
^-oxidoreduc 



larger dataset oft 501 proteins, the FFF again accurately 
identified all proteins with the active sice. In addition, 
it identified another protein, lfjin, a serine— threonine 
phosphatase. This result was initially discouraging bur 
t and clustering analysis 



n of proteins. It will also highlight the fact that human 
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bell you Its function 



Became proteins can have stmOar folds but dfffereat (urn 
determining the structure of i protein may or may not tefl 
thing about its function. The most wefrttudted example js i the 
band enzymes, of wtich triose-ptiosphate isomerase (TIM) is the arche- 
typal representative. Members of this family have slmSar overall struc- 
tures but Different functions, inducting different active sites, substrate 
spedficrOM and cefaclor requirements'".' 1 . ^ 

Is this exemple common? Our own analysis of the 1997 SCOP data- 
base" shows that the Ove largest fold tamiSes are the ferrsctoun- 
ike. the (o/B) barrels, the knottjns, the ImmunoglcteAv&e and the 
navddo»n«(e fold femifies with 22, 18. 13, 9 and 9 subfamSes, respec- 
tively (fig. D. In fact 57 of the SCOP fold famftes consist of multiple 
superfamtles. These date only show the tip of the Iceberg, because 
each superfamOy is further composed of protein famies and each in* 
vidua! family can have rsdfcaJty different functions. For example, Ihe 
fenerJoxrn&e superfandy contains fsmfas identfied as Fe-S fenedoxins, 



ubiqutous functions and structures thai c«xur across a rturnber of fom- 
ies. The article provides a useful analysis of the confidence wttwteih 
structure and tunc* an can be correlated". Knowing the protein struc- 
ture by Itself Is insufficient to annotate a number of function^ classes 
and is also Insufficient for annotating the specific details of protein 



I 30 Jm Trtose-phosphato 
.8 | Isomerase barrels 

"3 SB wvd ftevoctcodn-tBta FerredojdrvBke 

z o jw— ^— — ■ zr 5. 



Number of euperfamltoa per Ibid fcmly 



jf suparfamBes found In each SCOP fold tamJy. 



distribution of SCOP (httpv/wop jnrc-lrnbximjc.uVieopl. Fw a rnoredetaSed 



Pip-* 1 However, structure prediction ii &x more difficult 
for proteins that ire not homologous to proteins with 
known structure. At present, there axe two approaches for 
these sequences: cb wiiia folding" 1 -** and threading** -43 . 



i and then attempts to assemble the native struc- 
■ * method docs not rely on a library of 
folds, it can be used to predict novel 
fclda. TheicceutCASP3 procein-structure-prcdicooo 
experiment (http://ftedicCionCeotCT.Ilril.BOv/CASP3) 
involved the blind prediction of the structure of pro- 
teins whose actual structure was about to be experi- 
mentally determined. These results indicate that con- 
siderable progress ha» been made 4 ** 4 . For belied and 
d/p proteins with less than 110 residues, structures 
were often predicted whose backbone root-mean- 
square deviation (RMSD) from nadve tanged from 
4-7 A. Progress is being made with the 0 proteins, too, 
although they remain problematic. Because mh initio 
methods can identify novel folds, these methods could 
be used to help to select sequences likely to yield novel 
folds in experimental structurai-genomics projects. 

Another approach to tertiary-structure prediction is 
threading. Here, for the sequence of interest, one 
attempts to find the closest matching structure in a 
library of known folds"-**. Threading is applicable to 
proteins of up to 500 residues or so and is much faster 



be used to obtain novel folds. 



as 6e rued for aulommtic 



always) create inexact protein re 
tuxes useful for identifying tunc 
Using the cb initio structure-prediction program 
MONSSTER, the tertiary structure of a glutaredcoria, 



ic RMSD from the a 



the PFP for d 



ecdy folded 



observation alone « no longer adequate for identifying 
all functional sites in known protein structures. 

7b date, the use of structure to identify function has 
largely focused on high- resolution structures and highly 
detailed descriptors of protein functional sites. How- 
ever, the creation of inexact descriptors for functional 
sites opens the way to the application' 
to inexact, predicted protein model 
remains: how good does a model have to be in order 
to use FPFs to identify its active sites? 



PPF uniquely identified the active site in the correctly 
folded structure but not in the incorrectly folded ones 
(Fig. 2). This is a proof-o£- principle demonstration that 
inexact models produced by tk initio prediction of 
structure from sequence can be used for the subsequent 
prediction ofbiochemical function. Of course, improve- 
ments in the method have to be made before sucb 
i be done on a routine basis. 
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found. From 1 0% to 30% r 
are made thin by mltcrtutive sequence-bated approaches 
similar results are seen for the aJ$ hydro bra 0 . Sur- 
prisingly, In spite of the Act thit threading algorithms 
have problems generating good sequence- to-structure 
alignments, active sites are often accurately aligned, 
even tor very distant matches. This observation would 
agree with the above experimental results indicating 
Chat active sites are well conserved in protein structures. 

Importantly, the false-positive rate when using struc- 
tural information is much lower than that found using 



detailed comparison of the FFF structural approach and 
the Blocks sequence-motif approach (N. Siew el at., 
unpublished). In this study; the sequences in eight 
genomes, including Bacillus suhtilis, were analysed for 
diiulfidc-oxiddreductase function using the drsulEde- 
oxidoreductasc FFF, the tbioredoxin Block 00194 and 
the glutaredcedn Block 0019S. If we assume that those 
sequences identified by both the FFF and Blocks 
are 'true positives', we find 13 such sequence* in the 
B. suktiHs genome. 

There is no experimental evidence validating all of 
these 'true positives' and so they are more accurately 
termed 'consensus positives'. In order to find these 13 
'consensus positive' sequences, the FFF hits seven false 
positives. On the other hand. Blocks hits 23 false 
positives (Fig. 3). It was previously suggested that the 
use of a functional requirement adds information to 
threading and reduces the number of false positives 11 . 
These dalta, mcludingthe data'shawn in figl 3, validate 
this claim on a genome-wide basis. 

Of course, » no genome has had the function of all 
of its proteins experimentally armocsred, it is imposs- 
ible to know how many other proteins with the rpeci- 



This h a critical question for researchers attempting to 
predict protein function. E 
will be needed to validate 
fully. This points out the need for closely coupling 




as may be a problem 
only be applied to a 



Abo, the method can a 



Is. Techniques that will fii 
to dels will be quite usefi 
ic protein analysis to tbe next acp. 



Weaknesses of using the s 
Co-function method of function prediction 

Based on studies to date, the identification of enzy- 
matic activity require* a model in which (he backbone 
RMSDfi * 



n the core of the molecule than in the loops and « 



approaches ro protein- 
proved to be very useful, alter- 
assign the biochemical function 
whose function cannot be 
methods. One emerging 



paradigm. Such structures might be provided by stj 
' ' projects or by ftrucrure-pred" 

cdonal assignment is made by screen- 
ing the resulting structure against a library of structural 
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is yellow and cyan balls, raspecBwIy. Tha FFF dearty 
it site In the crystal structure of ft* gutaredosn lego 
i The FFF does not match to the active 




assigned to a given protein [an issue of major impor- 
tance, because proteins are multifunctional (Boat I)] 
and, ultimately, that having a structure can provide 
deeper insight into the biological mechanism of pro- 
tein function and regulation. The disadvantages are that 
one needs to hzvc the proteins structure before a func- 
tion can be assigned and that the approach is limits to 
those functions associated with proteins with at least 
one solved structure, so that a functional-site descriptor 
can be constructed. 



-fod^thr ac 



site match in a library of descriptors for known protein 
active sites. This is the 6rrt step in the long pieces of 
using structure to assign all levels of function, a goal 
that is made increasingly important with the emergence 
of structural genomics. Based on the progress to date, 
it is apparent that structure will play an important role 
in the poit-gcnomic era of biology. 
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Detailed descriptors will only work on the experi- 
mentally determined, high-quality structures. Ideally, 
however, the descriptors should work an both experi- 
mental structures and the cruder models provided by 
tertiary-structuie-prcdiction algorithms. 

bs oftuch an approach are that one need 
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BLAST 

Basic Local Alignment Search Tool 

Edit and Resubmit Save Search Strategies Formatting options Download 
Blast 2 sequences 

Protein Sequence (806 letters) 

Results for: |lcl|43901 None(806aa) jrj 

Your BLAST job specified more than one input sequence. This box lets you choose which input sequence to 
show BLAST results for. 

Query ID 

lcJJ43§Q1 

lcl|43901 
Description 

None 
Molecule type 

amino acid 
Query Length 

806 

Subject ID 

43903 
Description 

None 
Molecule type 

amino acid 
Subject Length 

1106 
Program 

BLASTP 2.2.22+ Citation 

Reference 

Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, 
and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search 
programs", Nucleic Acids Res. 25:3389-3402. 
Reference - compositional score matrix adjustment 

Stephen F. Altschul, John C. Wootton, E. Michael Gertz, Richa Agarwala, Aleksandr Morgulis, Alejandro A. 
Schaffer, and Yi-Kuo Yu (2005) "Protein database searches using compositionaliy adjusted substitution 
matrices", FEBS J. 272:5101-5109. 

Other reports: Search Summary [Taxonomy reports] [Multiple aliqnmentl •*** 
Search Parameters 



Search parameter name Search parameter value 

Program blastp 

Word size 3 

Expect value 10 

Hitlistsize 100 

Gapcosts 11,1 

Matrix BLOSUM62 

Filter string F 
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Genetic Code 
Window Size 
Threshold 

Composition-based stats 



1 

40 
11 

2 



Kartin-Altschul statistics 



Params Ungapped Gapped 



Lambda 0.318991 0.267 
K 0.133935 0.041 

H 0.404134 0.14 



Results Statistics 



Results Statistics parameter name Results Statistics parameter value 



An overview of the database sequences aligned to the query sequence is shown. The score of each 
alignment is indicated by one of five different colors, which divides the range of scores into five groups. 
Multiple alignments on the same database sequence are connected by a striped line. Mousing over a hit 
sequence causes the definition and score to be shown in the window at the top, clicking on a hit sequence 
takes the user to the associated alignments. New: This graphic is an overview of database sequences 
aligned to the query sequence. Alignments are color-coded by score, within one of five score ranges. Multiple 
alignments on the same database sequence are connected by a dashed line. Mousing over an alignment 
shows the alignment definition and score in the box at the top. Clicking an alignment displays the alignment 
detail. 



Effective search space 



809244 



Graphic Summary 



Distribution of 5 Blast Hits on the Query Sequence 



[21 
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Dot Matrix View / 

Plot of lcl|43901 vs 43903 £?1 

This dot matrix view shows regions of similarity based upon the BLAST results. The query sequence is 
represented on the X-axis and the numbers represent the bases/residues of the query. The subject is 
represented on the Y-axis and again the numbers represent the bases/residues of the subject. Alignments 
are shown in the plot as lines. Plus strand and protein matches are slanted from the bottom left to the upper 
right corner, minus strand matches are slanted from the upper left to the lower right. The number of lines 
shown in the plot is the same as the number of alignments found by BLAST. 

0 

Descriptions 

Score E 

Sequences producing significant alignments: (Bits) Value 

lcl|43903 unnamed protein product 6 2.8 le-13 



Alignments Select All Get selected sequences Distance tree of results Multiple alignment 

>lcl|43903 unnamed protein product 
Length=1106 

Score = 62.8 bits (151), Expect = le-13. Method: Compositional matrix adjust 
Identities = 84/382 (21%), Positives = 153/382 (40%), Gaps = 49/382 (12%) 

Query 41 LTILANTTLQITCRGQRDLDWLWPNAQRDSEERVLVTECGGGDSIFCKTLTIPRWGNDT 10 0 

L + ++T +TC G + W +R S+E D F LT+ + G DT 

Sbjct 42 LVLNVSSTFVLTCSGSAPWW ERMSQEPPQEM- AKAQDGTFSSVLTLTNLTGLDT 95 

Query 101 GAYKCSYRD VDIASTVTVYVRDYRSPFIASVSDQHGIVYITENKNKTVVIPCRGS 155 

G Y C++ D D +Y++V D F+ + +++ +++TE + IPCR + 

Sbjct 96 GEYFCTHNDSRGLETDERKRLYIFVPDPTVGFLPNDAEEL-FIFLTEITE - - ITIPCRVT 152 

Query 156 ISNLNVSLCARYPEKRFVPDGNRISVfDSEIGFTLPSYMISYAGMVFCEAKINDETYQSIM 215 

L V+L + + + +D + GF+ SY C+ ID S 

Sbjct 153 DPQLWTLHEKKGDVAL PVPYDHQRGFSGIFEDRSY I C KTT I GDREVDS DA 203 

Query 216 YIWWGYRIYDVILSPPHEIELSAGEKLVLNCTARTELNVGLDFTWHSPPSKSHHKKIV 275 

YV+ +V++ ++GE+LC N ++F W P +K 
Sbjct 2 04 YYVYRLQVS S INVSVNAVQTV - VRQGENITLMCI VI G - - NEWNFEWTYP RKES 254 

Query 276 NRDVKPFPGTVAKM FLSTLTIESVTKSDQGEYTCVASSGRMIKRNRTFVRVHT- -KP 330 

R V+P +M SLIS DG YTC + ++ + + 

Sbjct 255 GRLVEPVTDFLLDMPYHIRSILHIPSAELEDSGTYTCNVTESVNDHQDEKAINITVVESG 314 

Query 331 FIAFGSGMKSLVEATVGSQVRIPVKYLSYPAPDIKWYRNGRPIESNYTMIVG 382 

++ + +L A + + V + +YP P + W+++ R + + + 

Sbjct 315 YVRLLGEVGTLQFAELHRSRTLQWFEAYPPPTVLWFKDNRTLGDSSAGEIALSTRNVSE 374 

Query 383 -" DELTIMEVTERDAGNYTV 4 00 

ELT++ V +AG+YT+ 
Sbjct 375 TRYVS ELTLVRVKVAEAGHYTM 396 



Score = 40.8 bits (94), Expect = 4e-07, Method: Compositional matrix adjust. 
Identities = 45/189 (23%), Positives = 75/189 (39%), Gaps = 33/189 (17%) 

Query 563 ESVSLLCTADRNTFENLTWYKLGSQATSVHMGESLTPVCKNL-DALWKLNGTMFSNSTND 621 

E+ + +L+C NNW ++ G+PVLD+ + 

Sbjct 229 EN I TLM C I V I GNE WNF E WT Y PR KE S GRLVEPVTD FLLDMPYHIRS 274 

Query 622 I L I VAFQNASLQDQGDY VC S AQDKKTKKRHCLVKQL 1 1 LE RMAPMITGNLENQTTT 677 

1+ +A L+D G Y C+ + + + ++E R+ + G L+ 

Sbjct 275 --ILHIPSAELEDSGTYTCNVTESVNDHQDEKAINITVVESGYVRLLGEV-GTLQFAELH 331 

Query 678 IGETIEVTCPASGNPTPHITWFKDNETLVEDSGIVLRDGNRN LTIRRVRKE 728 
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Sbjct 332 RSRTLQWFEAY- 

Query 72 9 DGGLYTCQA 737 

+ G YT +A 
Sbjct 390 EAGHYTMRA 398 



Score = 20.4 bits (41), Expect = 0.52, Method: Compositional matrix adjust. 
Identities = 14/67 (20%), Positives = 28/67 (41%), Gaps = 5/67 (7%) 

Query 624 I VAFQNASLQDQGD Y VCSAQDKK TKKRHCLVKQLI I LERMAPM ITGNL ENQTTT I GE 680 

++ N + D G+Y C+D+ T+RL +++ ++E +E 

Sbjct 84 VLTLTNLTGLDTGEYFCTHNDSRGLETDERKRLY - - IFVPDPTVGFLPNDAEELFIFLTE 141 

Query 681 TIEVTCP 687 
E + T P 

Sbjct 142 ITEITIP 148 



Score = 20.4 bits (41), Expect = 0.59, Method: Compositional matrix adjust. 
Identities = 7/15 (46%), Positives = 8/15 (53%), Gaps = 0/15 (0%) 

Query 684 VTCPASGNPTPHITW 69 8 

V C G P P+I W 
Sbjct 434 VRCRGRGMPQPNIIW 448 



Score = 17.3 bits (33), Expect = 4.7, Method: Compositional matrix adjust. 
Identities = 8/32 (25%), Positives = 14/32 (43%), Gaps = 0/32 (0%) 

Query 162 SLCARYPEKRFVPDGNRISWDSEIGFTLPSYM 193 

+ + +KR P S +G LPS++ 

Sbjct 699 TFLQHHSDKRRPPSAELYSNALPVGLPLPSHV 730 



Select AH Get selected sequences Distance tree of results Multiple alignment 
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Basic Local Alignment Search Tool 

Edit and Resubmit Save Search Strategies Formatting options Download 
Blast 2 sequences 

Protein Sequence (806 letters) 

Results for: | lcl|60337 None(806aa) _*J 

Your BLAST job specified more than one input sequence. This box lets you choose which input sequence to 
show BLAST results for. 

Query ID 

Icli60337 

lcl|60337 
Description 

None 
Molecule type 

amino acid 
Query Length 

806 

Subject ID 

60339 
Description 

None 
Molecule type 

amino acid 
Subject Length 

1091 
Program 

BLASTP 2.2.22+ Citation 

Reference 

Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, 
and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search 
programs", Nucleic Acids Res. 25:3389-3402. 
Reference - compositional score matrix adjustment 

Stephen F. Altschul, John C. Wootton, E. Michael Gertz, Richa Agarwala, Aleksandr Morgulis, Alejandro A. 
Schaffer, and Yi-Kuo Yu (2005) "Protein database searches using compositionally adjusted substitution 
matrices", FEBS J. 272:5101-5109. 

Other reports: Search Summary [Taxonomy reportsl [Multiple alignment] 
Search Parameters 



Search parameter name Search parameter value 



Program blastp 

Word size 3 

Expect value 10 

Hitlistsize 100 

Gapcosts 11,1 

Matrix BLOSUM62 

Filter string F 



http://blast.ncbi.nlm.nih.gov/Blast.cgi 
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Genetic Code 1 

Window Size 40 

Threshold 1 1 

Composition-based stats 2 

Karlin-Altschul statistics 



Params Ungapped Gapped 

Lambda 0.318991 0.267 
K 0.133935 0.041 

H 0.404134 0.14 

Results Statistics 



Results Statistics parameter name Results Statistics parameter v alue 

Effective search space 799624 
Graphic Summary 

Distribution of 1 Blast Hits on the Query Sequence 



L2I 



An overview of the database sequences aligned to the query sequence is shown. The score of each 
alignment is indicated by one of five different colors, which divides the range of scores into five groups. 
Multiple alignments on the same database sequence are connected by a striped line. Mousing over a hit 
sequence causes the definition and score to be shown in the window at the top, clicking on a hit sequence 
takes the user to the associated alignments. New: This graphic is an overview of database sequences 
aligned to the query sequence. Alignments are color-coded by score, within one of five score ranges. Multiple 
alignments on the same database sequence are connected by a dashed line. Mousing over an alignment 
shows the alignment definition and score in the box at the top. Clicking an alignment displays the alignment 



0 



Color Key for alignment scores 



http://blast.ncbi.nJm.nih.gov/Blast.cgi 



2/15/2010 



NCBI BlastProtein Sequence (806 letters) 



Page 3 of 3 



Dot Matrix View 




Plot of lcl|60337 vs 60339 J7] 

This dot matrix view shows regions of similarity based upon the BLAST results. The query sequence is 
represented on the X-axis and the numbers represent the bases/residues of the query. The subject is 
represented on the Y-axis and again the numbers represent the bases/residues of the subject. Alignments 
are shown in the plot as lines. Plus strand and protein matches are slanted from the bottom left to the upper 
right corner, minus strand matches are slanted from the upper left to the lower right. The number of lines 
shown in the plot is the same as the number of alignments found by BLAST. 

0 

Descriptions 

Score E 

Sequences producing significant alignments: (Bits) Value 

lcl|60339 unnamed protein product 16 . 5 7.7 



Alignments Select All Get select ed se quences Distance tree of results Multiple alignment 

>lcl|60339 unnamed protein product 
Length=10 91 

Score = 16.5 bits (31), Expect = 7.7, Method: Compositional matrix adjust. 
Identities = 8/19 (42%), Positives = 8/19 (42%), Gaps = 0/19 (0%) 

Query 472 PGQTSPYACKEWRHVEDFQ 4 90 

PQPAE VDQ 
Sbjct 1071 PSQVLPPASPEGETVADLQ 1089 



Select All Get selected sequen ces Distance tree of results Multlp. e alignment 
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Basic Local Alignment Search Tool 

Edit and Resubmit Save Search Strategies Formatting o ptions D ownload 
Blast 2 sequences 

Protein Sequence (806 letters) 

Results for: |lcl|40585 None(806aa) j^j 

Your BLAST job specified more than one input sequence. This box lets you choose which input sequence to 
show BLAST results for. 

Query ID 

Icll40585 

lcl|40585 
Description 

None 
Molecule type 

amino acid 
Query Length 

806 

Subject ID 

40587 
Description 

None 
Molecule type 

amino acid 
Subject Length 

820 
Program 

BLASTP 2.2.22+ Citation 

Reference 

Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, 
and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search 
programs", Nucleic Acids Res. 25:3389-3402. 
Reference - compositional score matrix adjustment 

Stephen F. Altschul, John C. Wootton, E. Michael Gertz, Richa Agarwala, Aleksandr Morgulis, Alejandro A. 
Sch3ffer, and Yi-Kuo Yu (2005) "Protein database searches using compositionally adjusted substitution 
matrices", FEBS J. 272:5101-5109. 

Other reports: Search Summary [Taxonomy reports] [Multiple alignment! •«* 
Search Parameters 



Search parameter name Search parameter value 

Program blastp 

Word size 3 

Expect value 10 

Hitlistsize 100 

Gapcosts 11,1 

Matrix BLOSUM62 

Filter string F 
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Genetic Code 
Window Size 
Threshold 

Composition-based stats 



40 

11 
2 



Karlin-Altschui statistics 



Params Ungapped Gapped 



Lambda 0.318991 0.267 
K 0.133935 0.041 

H 0.404134 0.14 



Results Statistics 



Results Statistics parameter name Results Statistics parameter value 



An overview of the database sequences aligned to the query sequence is shown. The score of each 
alignment is indicated by one of five different colors, which divides the range of scores into five groups. 
Multiple alignments on the same database sequence are connected by a striped line. Mousing over a hit 
sequence causes the definition and score to be shown in the window at the top, clicking on a hit sequence 
takes the user to the associated alignments. New: This graphic is an overview of database sequences 
aligned to the query sequence. Alignments are color-coded by score, within one of five score ranges. Multiple 
alignments on the same database sequence are connected by a dashed line. Mousing over an alignment 
shows the alignment definition and score in the box at the top. Clicking an alignment displays the alignment 
detail. 



Effective search space 



595935 



Graphic Summary 



Distribution of 12 Blast Hits on the Query Sequence 



[21 



Color key for alignment scores 
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Dot Matrix Vie w ' 



Plot of lcl|40585 vs 40587 £21 

This dot matrix view shows regions of similarity based upon the BLAST results. The query sequence is 
represented on the X-axis and the numbers represent the bases/residues of the query. The subject is 
represented on the Y-axis and again the numbers represent the bases/residues of the subject. Alignments 
are shown in the plot as lines. Plus strand and protein matches are slanted from the bottom left to the upper 
right corner, minus strand matches are slanted from the upper left to the lower right. The number of lines 
shown in the plot is the same as the number of alignments found by BLAST. 




Descriptions 



Sequences producing significant alignments: (Bits) Value 

lcl|40587 unnamed protein product 57 . 8 3e-12 



Alignments Select All Get selected sequences Distance tree of results Multiple alignment 1 

>lcl|40587 unnamed protein product 
Length=82 0 



Sbjct 
Query 
Sbjct 
Query 
Sbjct 



634 
224 
691 
282 
736 
342 



DQGDYVCSAQDKKTKKRHCLVKQLIILERMA- -PMITGNLE-NQTTTIGETIEVTCPASG 69 0 
D+G+Y C +++ H QL ++ER P++ L N+T +G +E C 

DKGNYTCIVENEYGSINHTY--QLDWERSPHRPILQAGLPANKTVALGSNVEFMCKVYS 281 

NPTPHITWFKDNET LVEDSG IVLRDGNRN - LT I RRVRKEDGGLYTC 735 

+P PHI W K E +++ +G+ D L +R V ED G YTC 

DPQ PH IQ WLKHI EVNGS KIGPDNLPYVQI LKTAGVNTTDKEMEVLHLRNVS FEDAGEYTC 341 



QACNVLGCARAET-LFIIEGAQEKTN LEVIILVGTAVI 

A N +G + L ++E + E + LE+II A + 

LAGNSIGLSHHSAWLTVLEALEERPAVMTSPLYLEIIIYCTGAFL 



772 
386 



e = 41.6 bits (96), Expect = 2e-07, Method: Compositional matrix adjust, 
titles = 34/148 (22%), Positives = 62/148 (41%), Gaps = 8/148 (5%) 



Query 
Sbjct 
Query 
Sbjct 
Query 
Sbjct 



325 RVHTKPFIAFGSGMKSLVEATVGSQ-VRIPVKYLSYPAPDIKWYRNGRPIESN YT 378 

R+ P+ M+ + A ++ V+ P P ++W +NG+ + + Y 

14 8 RMPVAPYWTSPEKMEKKLHAVPAAKTVKFKCPSSGTPNPTLRWLKNGKEFKPDHRIGGYK 207 

379 MIVGDELTIME-VTERDAGNYTVILTNPISMEKQSHMVSLVVNVPPQIGEKALISPMNSY 437 

+ IM+ V D GNYT 1+ N ++ + +V P + +A + + 

208 VRYATWSIIMDSWPSDKGNYTCIVENEYGSINHTYQLDWERSPHRPILQAGLPANKTV 267 

438 QYGTMQTLTCTVYANPPLHHIQWYWQLE 4 65 

G+ C VY++ P HIQW +E 

268 ALGSNVEFMCKVYSD-PQPHIQWLKHIE 294 



Score = 39.7 bits (91), Expect = 6e-07, Method: Compositional matrix adju 
Identities = 36/152 (23%), Positives = 64/152 (42%), Gaps = 32/152 (21%) 



Sbjct 
Query 
Sbjct 
Query 



293 TLTIESVTKSDQGEYTCVASSGRMIKRNRTFV RVHTKPFIAFGSGMKSLVEATVG 347 

++ ++SV SD+G YTC+ + N T+ R +P + +G+ + +G 

214 SI IMDSWPSDKGNYTCIVEN-EYGSINHTYQLDWERSPHRPILQ- -AGLPANKTVALG 270 

348 SQ VR I PVKYLSYPAPDI KWYRNGRP I E SNYTMI VGDELTIMEVTE 392 

S V K S P P I+W ++ IE N + I D L +++ + 
271 SNVEFMCKVYSDPQPHIQWLKH I EVNGS K I GPDNL PY VQ I LKTAG VNTTDKEMEVLH 327 

393 RDAGNYTVILTNPISMEKQSHMVSLV 418 
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DAG YT + N I + S ++++ 
Sbjct 328 LRNVSFEDAGEYTCLAGNSIGLSHHSAWLTVL 359 



Query 679 GETIEVTCPASGNPTPHITWFKDNETLVED SGIVLRDGNRNLTIRRVRKEDGGLYTC 735 

+T++ CP+SG PP+WK++ D G+R +++V DG YTC 
Sbjct 171 AKTVKFKCPSSGT PNPTLRWLKNGKEFKPDHRIGGYKVRYATWSIIMDSWPSDKGNYTC 230 

Query 736 QACNVLG 742 
N G 

Sbjct 231 IVENEYG 237 



IKWYRNGRPI-ESNYTMIVGDELTIMEVTERDAGNYTVILTNPISMEKQSHMVSLWNVP 422 
I W R+G + ESN T I G+E+ + + D+G Y + ++P S VNV 

INWLRDGVQLAESNRTRITGEEVEVQDSVPADSGLYACVTSSP SGSDTTYFSVNVS 119 

PQIGE KALI SPMNSYQYGTMQTLTCTVYANPPLHHIQWYWQLEEAC SYR 471 

+ + +T P + YW E + 
DALPSSEDDDDDDDSSSEEKETDNTKPNRMP VAPYWTSPEKMEKKLHAVPAAKTVK 175 

PGQTSPYACKEW-RHVEDFQGGNKIEVTKNQYALIEGKNKTVSTLVIQAANVS--AL 525 

P +P W ++ ++F+ ++I K +YA ++++ + S 

FKCPSSGTPNPTLRWLKNGKEFKPDHRIGGYKVRYA TWSI IMDSWPSDKGN 2 2 7 

YKCEAINKAGRGERVISFHVI -RGPEITVQPAAQPTEQ ESVSLLCTADRNTFENL 5 79 

Y C N+ G V+ R P + A P + +V +C + + + 

YTCIVENEYGS INHTYQLDWERS PHRPI LQAGLPANKTVALGSNVEFMCKVYSDPQPHI 2 87 

TWYKLGSQATSVHM GESLTPVCKNLDALWKLNGTMFSNSTNDILIVAFQNASLQDQG 63 6 

WK H+ G+P NL+L +++++++NS+DG 

QWLK HIEVNGSKIGP--DNLPYVQILKTAGVNTTDKEMEVLHLRNVSFEDAG 3 37 

DYVCSA 642 
+Y C A 
EYTCLA 343 

Score = 24.6 bits (52), Expect = 0.021, Method: Compositional matrix adjust. 
Identities = 10/36 (27%) , Positives = 20/36 (55%) , Gaps = 0/36 (0%) 

Query 291 LSTLTIESVTKSDQGEYTCVASSGRMIKRNRTFVRV 326 

+ L + +V+ D GEYTC+A + + + ++ V 
Sbjct 323 MEVLHLRNVSFEDAGEYTCLAGNS IGLSHHSAWLTV 358 

Score = 21.6 bits (44), Expect = 0.21, Method: Compositional matrix adjust. 
Identities = 19/75 (25%) , Positives = 29/75 (38%) , Gaps = 12/75 (16%) 

Query 49 LQITCRGQRD LDWLWPNAQRDSEERVLVTECGGGDSIFCKTLTIPRWGNDTGAYKC 105 

LQ+ CR + D ++WL Q R +T G+ + + V D+G Y C 

Sbjct 51 LQLRCRLRDDVQSINWLRDGVQLAESNRTRIT GEEV EVQDSVPADSGLYAC 101 

Query 106 SYRDVDIASTVYVYV 120 
+ T Y V 

Sbjct 102 VTSSPSGSDTTYFSV 116 



Query 696 ITWFKDNETLVEDSGIVLRDGNRNLTIRRVRKEDGGLYTC 735 

IW+DLE+ R +++ D GLY C 

Sbjct 64 INWLRDGVQLAESNRT- -RITGEEVEVQDSVPADSGLYAC 101 



Query 24 7 NCT ARTE LNVG LD FT WH S P P S K 2 68 

NCT EL + + WH+ PS+ 
Sbjct 72 2 NCT- -NELYMMMRDCWHAVPSQ 741 



Query 


364 


Sbjct 


64 


Query 


423 


Sbjct 
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Query 


526 


Sb j ct 


228 




580 


Sbjct 


288 


Query 


637 


Sbjct 


338 
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Score » 18.9 bits (37), Expect = 1.4, Method: Compositional matrix adjust. 
Identities = 6/16 (37%), Positives » 8/16 (50%), Gaps = 0/16 (0%) 

Query 46 8 CSYRPGQTSPYAC KEW 4 83 

C+ RP T P + W 
Sbjct 19 CTARPSPTLPEQAQPW 34 



Score = 18.1 bits (35), Expect = 2.2, Method: Compositional matrix adjust. 
Identities = 18/75 (24%), Positives = 29/75 (38%), Gaps = 12/75 (16%) 

Query 37 QKDILTILANTTLQITCRG QRDLDWLWPNAQRDSEERVLVTECGGGDSIFCKTLTI 92 

+K + + A T++ C L WL + + R+ GG + T +1 

Sbjct 162 EKKLHAVPAAKTVKFKCPSSGTPNPTLRWLKNGKEFKPDHRI GGYKVRYATWS I 215 

Query 93 - - PRWGNDTGAYKC 105 

W +D G Y C 
Sbjct 216 IMDSWPSDKGNYTC 23 0 



Score = 16.5 bits (31), Expect = 6.2, Method: Compositional matrix adjust. 

Identities = 6/12 (50%), Positives = 8/12 (66%), Gaps = 0/12 (0%) 

Query 673 NQTTT I GET I EV 684 

N+T GE +EV 

Sbjct 77 NRTRITGEEVEV 8 8 
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